-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new catchup mode to use transaction results to skip failed transaction and signature verification #4536
base: master
Are you sure you want to change the base?
Conversation
1116893
to
b03fc48
Compare
src/transactions/TransactionFrame.h
Outdated
@@ -70,6 +70,10 @@ class TransactionFrame : public TransactionFrameBase | |||
mutable Hash mFullHash; // the hash of the contents and the sig. | |||
|
|||
std::vector<std::shared_ptr<OperationFrame const>> mOperations; | |||
mutable std::optional<TransactionResult> mReplaySuccessfulTransactionResult{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid adding mutable state to TransactionFrame - @SirTyson recently did a lot of work to make TransactionFrame immutable, so we should keep it that way. Reason is that tx frame is used across the codebase for overlay flooding as well as ledger application, so any incorrectly set mutable state leads to awful failure modes like state divergence. If you want to override results, could we modify MutableTransactionResult class instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, that makes sense. I'll change this to store the replay result in MutableTransactionResult.
@@ -1785,8 +1816,19 @@ TransactionFrame::apply(AppConnector& app, AbstractLedgerTxn& ltx, | |||
{ | |||
mCachedAccountPreProtocol8.reset(); | |||
uint32_t ledgerVersion = ltx.loadHeader().current().ledgerVersion; | |||
SignatureChecker signatureChecker{ledgerVersion, getContentsHash(), | |||
getSignatures(mEnvelope)}; | |||
auto skipMode = mReplayFailingTransactionResult.has_value() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct: don't we still need to do signature verification for failed transactions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A good sanity check for this check would be running full parallel catchup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update one time signers (which are removed whether the signature verification succeeds or fails), but I don't see why we need to verify signatures for failed transactions as it doesn't effect the ledger state.
src/transactions/SignatureChecker.h
Outdated
virtual bool checkSignature(std::vector<Signer> const&, int32_t) = 0; | ||
virtual bool checkAllSignaturesUsed() const = 0; | ||
}; | ||
class SignatureCheckerImpl : public SignatureChecker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request for some renaming for clarity:
- SignatureChecker -> AbstractSignatureChecker
- SignatureCheckerImpl -> SignatureChecker
src/catchup/ApplyCheckpointWork.h
Outdated
TransactionHistoryEntry mTxHistoryEntry; | ||
TransactionHistoryResultEntry mTxHistoryResultEntry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid footguns, results should be optional
// to mark catchup as complete and node as synced. In OFFLINE mode node is not | ||
// connected to network, so new ledgers are not being externalized. Only | ||
// buckets and transactions from history archives are applied. | ||
// Catchup can be done in two modes - ONLINE and OFFLINE. In ONLINE mode, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the formatting change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworded it a bit, which changed the line lengths and formatting.
src/catchup/DownloadApplyTxsWork.cpp
Outdated
mCheckpointToQueue); | ||
auto getAndUnzip = | ||
std::make_shared<GetAndUnzipRemoteFileWork>(mApp, ft, mArchive); | ||
OnFailureCallback cb = [archive = mArchive, filesToTransfer]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the implementation of OnFailureCallback the way it was before: mArchive can be null, in which case GetAndUnzipRemoteFileWork
will pick a random archive. With the current change, core would crash de-referencing a null pointer.
src/catchup/DownloadApplyTxsWork.cpp
Outdated
@@ -98,8 +94,10 @@ DownloadApplyTxsWork::yieldMoreWork() | |||
{ | |||
auto prev = mLastYieldedWork; | |||
bool pqFellBehind = false; | |||
auto applyName = apply->getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
src/catchup/DownloadApplyTxsWork.cpp
Outdated
std::filesystem::remove( | ||
std::filesystem::path(ft.localPath_nogz())); | ||
CLOG_DEBUG(History, "Deleted transactions {}", | ||
CLOG_DEBUG(History, "Deleting transactions {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the log is misleading, since we're deleting transactions and results
src/ledger/LedgerManagerImpl.cpp
Outdated
@@ -1520,6 +1522,17 @@ LedgerManagerImpl::applyTransactions( | |||
|
|||
prefetchTransactionData(txs); | |||
|
|||
std::optional<std::vector<TransactionResultPair>::const_iterator> | |||
expectedResultsIter = std::nullopt; | |||
if (mApp.getConfig().CATCHUP_SKIP_KNOWN_RESULTS && expectedResults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a valid scenario when CATCHUP_SKIP_KNOWN_RESULTS=true but expectedResults is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the CATCHUP_SKIP_KNOWN_RESULTS=true
, but for some reason we do not have expectedResults (e.g. if for some reason they aren't in the archive), catchup will proceed but no transactions will be skipped. I think I should probably add a warning to inform users that this is the case, but I'm hesitant to report an error in that case as its not fatal. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this makes me realize: with the current implementation of catchup, won't it fail if results couldn't be downloaded? If so, I think we should make it less strict, so it'll try to download results, but if it can't, catchup should still proceed normally. In this case, I agree the logic here makes sense and we can add a warning (probably not at individual ledger level though as it'll get spammy. We can warn per checkpoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed WorkSequence
to take an vector of "optional" works, which, if they fail do not block the execution of subsequent works. This step now will log a warning every checkpoint if the downloading of results fails, but catchup will proceed.
src/ledger/LedgerManagerImpl.cpp
Outdated
{ | ||
releaseAssert(*expectedResultsIter != | ||
expectedResults->results.end()); | ||
while ((*expectedResultsIter)->transactionHash != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit suspicious: the order of results should match the order of transactions application exactly. I think you can do expectedResults->at(i)
instead of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the order should match, but the loop accounts for the case when there are gaps in the results stored in the archive, which I believe can be the case when there are ledgers with empty txn sets (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have ledgers with empty sets, then txs
should be empty as well, so we should use the same index logic for both results and txs
ff8f71a
to
c2bc090
Compare
// here. | ||
std::optional<std::vector<TransactionResultPair>::const_iterator> | ||
expectedResultsIter = std::nullopt; | ||
if (expectedResults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add an invariant either here or in TransactionFrame that makes sure we're actually doing catchup when we see an expected result. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, something like releaseAssert(mApp.getCatchupManager().isCatchupInitialized());
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for CatchupManagerImpl::catchupWorkIsDone()
to be true along with a non null mCatchupWork
here? I don't recall how the catchup state transitions work. Maybe @marta-lokhova has some input on a safe way to do this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change targets a very specific use case (parallel catchup for testing), I'd recommend putting all this functionality behind BUILD_TESTS. The prod paths should remain unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I thought this was meant for running catchup in general. Is it just for parallel catchup testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta, events and metrics are not equivalent to live execution when this mode is enabled, making it unsuitable to general catchup use case. It could be used in all scenarios where one doesn't care about them. Its not limited to parallel catchup, but generally development/testing scenarios so I agree that it should be behind an ifdef BUILDTESTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't win much by skipping work in prod paths. Different application flows increase the risk of divergence. Parallel catchup seems like a nice middle ground given the performance benefits + low risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. In that case, a lot more of this PR should be behind the BUILD_TESTS ifdef.
…e to ifdef BUILDTEST
e2ad648
to
c451463
Compare
Description
Resolves #X
Adds a new config option,
CATCHUP_SKIP_KNOWN_RESULTS
. When this config option is enabled, transaction results are downloaded from history archives for the catchup range. Failed transactions are not applied, and signatures are not verified.Preliminary perf testing locally running catchup on 1000 ledgers:
Supercluster PubnetParallelCatchup with these changes completed in 14h 47min (vs ~24 hours with recent releases / master HEAD).
https://buildmeister-v3.stellar-ops.com/job/Core/job/stellar-supercluster/1055/
Checklist
clang-format
v8.0.0 (viamake format
or the Visual Studio extension)